Linear regression using numpy

by Pritish Jadhav, Mrunal Jadhav - Fri, 08 Jun 2018

Tags: #python #Linear Regression #Numpy

Linear Regression¶

Linear Regression is a statistical approach of modelling the realtionship between a scalar response (aka. dependent variable) and one or more explanatory variables (aka independent variables). Linear Regression is one of the most basic building block in Machine Learning algorithms.

In most of the python libraries, linear regression model is available as a blackbox. However, it is imperative to understand what goes on under the hood in order to have better grasp of algorithm.

This article will focus on implementing vanilla linear regression from scratch using numpy and pandas.

Lets start by importing some basic python libraries.

In [7]:

import numpy as np
import pandas as pd
from sklearn import datasets

from sklearn.preprocessing import MinMaxScaler

from IPython.core.display import display,HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))

$Notations \ that \ we \ will \ be \ using \ throughout \ are \ as \ follows -$

\begin{align}X - Input \ data \end{align}

\begin{align}Y - Target \ labels \end{align}
\begin{align}W - \ weights \ vector \ (to \ be \ estimated \ using \ training \ data) \end{align}
\begin{align}b - \ intercept \ for \ the \ decision \ boundary \end{align}

$ Notations \ for \ tracking \ dimensions \ are \ as \ follows- $ \begin{align}m - number \ of \ training \ examples \end{align}
\begin{align}n - number \ of \ input \ features \end{align}
\begin{align}k - \ size \ of \ labels \ vector (In \ case \ of \ binary \ classification,\ k\ =\ 1) \end{align}

Load Diabetes dataset provided by sklearn¶

Diabetes dataset ¶

Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

Data Set Characteristics:

:Number of Instances: 442

:Number of Attributes: First 10 columns are numeric predictive values

:Target: Column 11 is a quantitative measure of disease progression one year after baseline

:Attribute Information:

  - Age
  - Sex
  - Body mass index
  - Average blood pressure
  - S1
  - S2
  - S3
  - S4
  - S5
  - S6

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times n_samples (i.e. the sum of squares of each column totals 1).

Source URL: http://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see: Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499. (http://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

In [8]:

def load_diabetes_data():
    
    diabetes = datasets.load_diabetes()
    X, Y = diabetes['data'], diabetes['target'].reshape(len(diabetes['target']), 1)
    return X, Y

Helper functions for scaling data.¶

We shall leverage the existing scaling function supported by Sklearn.

In [9]:

def get_scale_object(data):
    scaler = MinMaxScaler()
    scaler.fit(data)
    return scaler

def get_scaled_data(scaler, data):
    return scaler.transform(data)

$\ While \ implementing \ Linear \ Regression \ from \ scratch, \ it \ is \ important \ to \ keep \ track \ of \ dimensions.$

$\ For \ this \ tutorial, \ we \ shall \ use \ following \ dimensions -$

\begin{equation*} \tag{1}\label{1} X =\ m \times \ n \ \end{equation*}

\begin{align}\tag{2}\label{2} Y = \ m \times \ 1 \end{align}
\begin{align}\tag{3}\label{3} W = \ 1 \times \ n \end{align}
\begin{align}\tag{4}\label{4}b = \ 1 \times \ 1 \end{align}

The pseudo-code for training a Linear Regression model is as follows -

Step 1 -¶

Given a Training DataSet X and corresponding labels Y, initialize values for number of training examples(m), number of features(n) and number of output units (k) as mentioned in equations \ref{1}, \ref{2}, \ref{3} and \ref{4}

Step 2 -¶

Initialize Weights matrix W and bias b using dimensions from Step 1.

\begin{equation} for\ \ \ i_{epoch}\ \ \ in\ \ \ range(iterations):\end{equation}

Step 3 - Forward Propagation -¶

Step 3.1 compute preactivations using \ref{5}¶

\begin{align} y_{pred} = h(\theta) = XW^T +b \tag{5}\label{5} \end{align}

Lets validate if the dimensions match - \begin{equation} dim(y_{pred}) = \mathbf{(\ m \times \ n )} \times \ \mathbf{(1 \times \ n)}^\intercal \ + \ (1 \times 1) \end{equation}

We are good to move ahead.

Step 3.2 - Compute loss /error /cost using \ref{6}¶

\begin{align} cost = - \ \frac{1}{2m} \sum_{i =1}^{m} (y_{true} - y_{pred})^2 \tag{7} \label{6} \end{align}

The above equation represents the 'mean squared error'. It is important to note that, cost/ error/ loss is a scaler which we would like to minimize using gradient descent.

Step 4 - Back propagation -¶

Back propagation technique will help in updating the weights such that the MSE loss is minimized. Lets dig into the nuts and bolts of back propagation.

We shall derive the gradients using chain rule -

\begin{equation} \frac{\partial L}{\partial W} \ = \ \frac{\partial L}{\partial h(\theta)} \frac{\partial h(\theta)}{\partial W} \tag{8}\label{8} \end{equation}

Lets Start by computing the partial derivative - $\frac{\partial L}{\partial h(\theta)}$
\begin{align} \frac{\partial L}{\partial h(\theta)} & = \frac{\partial }{\partial h(\theta)} \Big[ - \ \frac{1}{2m} \sum_{i =1}^{m} (y_{true} - h(\theta))^2 \Big] \end{align}

For one training example $i$ -

\begin{align} \frac{\partial L}{\partial h(\theta)} & = \frac{-1}{2} \frac{\partial}{\partial h(\theta)} \Big[(y_{true} - h(\theta))^2 \Big]\\ & = \frac{-1}{2} 2\Big[ y_{true} - h(\theta)\Big] \\ & = -\Big[ y_{true} - h(\theta)\Big] \\ & = \Big[h(\theta) - y_{true} \Big] \tag{9} \label{9} \end{align}

Now, The last part of this derivation to compute gradients with respect to weights/coeffiecients.

\begin{align}\\ \frac{\partial L}{\partial W} & = \frac{\partial L}{\partial h(\theta)} \frac{\partial h(\theta)}{\partial W} \\ & = \frac{\partial L}{\partial h(\theta)} \Big[ \frac{\partial h(\theta)}{\partial W_1} \frac{\partial h(\theta)}{\partial W_2} .... \frac{\partial h(\theta)}{\partial W_n}\Big] \end{align}

The derivatives $\Big[ \frac{\partial Z}{\partial W_1} \frac{\partial Z}{\partial W_2} .... \frac{\partial Z}{\partial W_n}\Big]$ can be easily calculated as follows -

\begin{align} \\ \frac{\partial h(\theta)}{\partial W_1} & = \frac{\partial }{\partial W_1} \Big(x_1w_1 +x_2w_2 + ....+x_nw_n + b \Big) \\ & = x_1 \end{align}

Similarly, \begin{align} \\ \frac{\partial h(\theta)}{\partial W_2} = x_2 \\ \frac{\partial h(\theta)}{\partial W_3} = x_3 \\. \\. \\. \frac{\partial h(\theta)}{\partial W_n} = x_n \end{align}

Now, the gradient of bias term b is simply - \begin{align}\\ \frac{\partial L}{\partial b} & = \frac{\partial L}{\partial h(\theta)} \frac{\partial h(\theta)}{\partial B} \\ & = (h(\theta) - Y^{(i)}) \frac{\partial }{\partial b} \Big(x_1w_1 +x_2w_2 + ....+x_nw_n + b \Big) \\ & = (h(\theta) - Y{(i)}) (1) \end{align}

Therefore, for one training example, the gradient wrt weights can be concisely computed using -

\begin{align} \\ \frac{\partial L}{\partial W} = (h(\theta)-Y^{(i)}) [x_1^{(i)} x_2^{(i)} x_3^{(i)} .... x_n^{(i)}] \\ \frac{\partial L}{\partial b} = (h(\theta)-Y^{(i)}) \end{align}

By extending the computations for one training example to m training examples, we get -

\begin{align} \\ \frac{\partial L}{\partial W} = \sum_{i =1}^{m} \frac{\partial {L^{(i)}}}{\partial W} \tag{13}\label{13} \\ \frac{\partial L}{\partial b} = \sum_{i =1}^{m} \frac{\partial {L^{(i)}}}{\partial b} \tag{14}\label{14} \end{align}

Equations \ref{13} and \ref{14} can be summarised as follows -

\begin{align} \\ \frac{\partial L}{\partial W} = \frac{1}{m}(h(\theta) - Y)^T X \tag{15}\label{15} \\ \frac{\partial L}{\partial b} = \frac{1}{m} \sum_{i =1}^{m} (h(\theta) - Y^{(i)}) \tag{16}\label{16} \end{align}

Step 4.1 The weights can be updated using gradient descent equations as follows -¶

\begin{align} \\ W = W - \frac{\partial L}{\partial W} \tag{17} \label{17} \\ b = b - \frac{\partial L}{\partial b} \tag{18} \label{18} \end{align}

Implement Step 1-¶

In [10]:

raw_X, Y = load_diabetes_data()
scaler = get_scale_object(raw_X)
X = get_scaled_data(scaler, raw_X)

Helper functions for initializing vectors¶

In [11]:

def initialize_dimensions(X):
    m, n = X.shape
    return m, n

In [12]:

def initialize_weight_vectors(dim):
    W = np.zeros(shape = dim)
    b = np.zeros(shape = (1, 1))
    return W, b

Initialize weights vector¶

In [13]:

m, n = initialize_dimensions(X)
W, b = initialize_weight_vectors((1, n))

functions for computing cost and gradient descent -¶

In [14]:

def compute_regression_cost(y, y_pred, m):
    cost = (1.0/(2*m)) * (np.sum(y_pred - y)**2) 
    return cost

In [15]:

def compute_regression_gradients(y_pred, y, X, m):
    dw = (1.0/m)* np.dot((y_pred - y).T, X)
    db = (1.0/m) * np.sum((y_pred - y), axis = 1, keepdims = True)
    return dw, db

In [16]:

def propagation(w, b, X, Y):
    # FORWARD PROPAGATION (FROM X TO COST)

    z = (np.dot(X,w.T) + b)
    cost = compute_regression_cost(Y, z, X.shape[1])
    
    # BACKWARD PROPAGATION (TO FIND GRAD)
    dw, db = compute_regression_gradients(z, Y, X, X.shape[0])

    cost = np.squeeze(cost)
    
    grads = {"dw": dw,
             "db": db}
    
    return grads, cost

In [27]:

def train_linear_regression(X, Y, learning_rate = 0.2, n_epochs = 4000, print_cost = True):
    m, n = initialize_dimensions(X)
    print "There are %d features and %d training examples"%(n, m)
    w, b = initialize_weight_vectors((1, n))
    print "W has a shape of %d rows and %d columns"%(w.shape[0], w.shape[1])
    print "b has a shape of %d rows and %d columns"%(b.shape[0], b.shape[1])
    
    costs = []
    for iteration in range(n_epochs):
        grad, cost = propagation(w, b, X, Y)
        
        dw = grad['dw']
        db = grad['db']
        
        w = w - learning_rate * dw
        b = b - learning_rate * db
        
        # Record the costs
        if iteration % 10 == 0:
            costs.append(cost)

        # Print the cost every 100 training examples
        if print_cost and iteration % 100 == 0:
            y_pred = np.dot(X, w.T) + b
            accuracy = np.sqrt(np.mean((y_pred - Y)**2))
            display("Cost after iteration %i: %f | rmse after iteration %i: %f" % (iteration, cost, iteration, accuracy))

    params = {"w": w,
              "b": b}

    grads = {"dw": dw,
             "db": db}

    return params, grads, costs

In [28]:

params, _, _ = train_linear_regression(X, Y, print_cost= True)

y_predicted = (np.dot(X, params["w"].T) + params["b"])

rmse = np.sqrt(np.mean((y_predicted - Y)**2))

ssr = (np.mean((y_predicted - Y)**2))
sst = (np.mean((np.mean(Y) - Y)**2))

R2_score = 1.0 - (ssr/sst)
display("Model R2 score is %f"%R2_score)

There are 10 features and 442 training examples
W has a shape of 1 rows and 10 columns
b has a shape of 1 rows and 1 columns

'Cost after iteration 0: 226081052.450000 | rmse after iteration 0: 116.990459'

'Cost after iteration 100: 10493.139458 | rmse after iteration 100: 54.158847'

'Cost after iteration 200: 16809.508859 | rmse after iteration 200: 50.085980'

'Cost after iteration 300: 16084.961670 | rmse after iteration 300: 47.451121'

'Cost after iteration 400: 13590.549152 | rmse after iteration 400: 45.204037'

'Cost after iteration 500: 11199.814693 | rmse after iteration 500: 43.133430'

'Cost after iteration 600: 9242.379808 | rmse after iteration 600: 41.182861'

'Cost after iteration 700: 7682.854204 | rmse after iteration 700: 39.331607'

'Cost after iteration 800: 6434.096605 | rmse after iteration 800: 37.569269'

'Cost after iteration 900: 5421.965490 | rmse after iteration 900: 35.889176'

'Cost after iteration 1000: 4591.795565 | rmse after iteration 1000: 34.286292'

'Cost after iteration 1100: 3904.114787 | rmse after iteration 1100: 32.756418'

'Cost after iteration 1200: 3330.014526 | rmse after iteration 1200: 31.295852'

'Cost after iteration 1300: 2847.813155 | rmse after iteration 1300: 29.901218'

'Cost after iteration 1400: 2440.854775 | rmse after iteration 1400: 28.569382'

'Cost after iteration 1500: 2096.074458 | rmse after iteration 1500: 27.297403'

'Cost after iteration 1600: 1803.047922 | rmse after iteration 1600: 26.082505'

'Cost after iteration 1700: 1553.344583 | rmse after iteration 1700: 24.922061'

'Cost after iteration 1800: 1340.073395 | rmse after iteration 1800: 23.813577'

'Cost after iteration 1900: 1157.554376 | rmse after iteration 1900: 22.754681'

'Cost after iteration 2000: 1001.074527 | rmse after iteration 2000: 21.743118'

'Cost after iteration 2100: 866.702205 | rmse after iteration 2100: 20.776741'

'Cost after iteration 2200: 751.143162 | rmse after iteration 2200: 19.853506'

'Cost after iteration 2300: 651.627104 | rmse after iteration 2300: 18.971463'

'Cost after iteration 2400: 565.817114 | rmse after iteration 2400: 18.128757'

'Cost after iteration 2500: 491.736553 | rmse after iteration 2500: 17.323615'

'Cost after iteration 2600: 427.709570 | rmse after iteration 2600: 16.554348'

'Cost after iteration 2700: 372.312324 | rmse after iteration 2700: 15.819347'

'Cost after iteration 2800: 324.332770 | rmse after iteration 2800: 15.117073'

'Cost after iteration 2900: 282.737338 | rmse after iteration 2900: 14.446059'

'Cost after iteration 3000: 246.643212 | rmse after iteration 3000: 13.804907'

'Cost after iteration 3100: 215.295183 | rmse after iteration 3100: 13.192279'

'Cost after iteration 3200: 188.046277 | rmse after iteration 3200: 12.606901'

'Cost after iteration 3300: 164.341477 | rmse after iteration 3300: 12.047553'

'Cost after iteration 3400: 143.704036 | rmse after iteration 3400: 11.513074'

'Cost after iteration 3500: 125.723922 | rmse after iteration 3500: 11.002354'

'Cost after iteration 3600: 110.048066 | rmse after iteration 3600: 10.514331'

'Cost after iteration 3700: 96.372095 | rmse after iteration 3700: 10.047994'

'Cost after iteration 3800: 84.433325 | rmse after iteration 3800: 9.602376'

'Cost after iteration 3900: 74.004814 | rmse after iteration 3900: 9.176553'

'Model R2 score is 0.987019'

Incredible !!¶

We have successfully trained a Linear Regression Algorithm.